你以為我要說怎麼把自己的資料轉成 Datasets 相符的格式嗎?不要。要準備資料好麻煩
所以我就繼續講下去囉
這邊可以用一個變數來表示我們要 fine-tune 哪一個原始模型,這邊選擇 Whisper small 版selected_model = "openai/whisper-small"
可能會有人問為何不要用 medium 的?
因為不是大家的 GPU 都很厲害,我會在下一篇提到如果用 medium 會怎麼樣
再來是特徵提取,會用到 FeatureExtractor
下面記得改成你要的模型
from transformers import WhisperFeatureExtractor
feature_extractor = WhisperFeatureExtractor.from_pretrained("openai/whisper-small")
這邊則是 Tokenizer
,要先把模型抓進來 pre-trained
from transformers import WhisperTokenizer
tokenizer = WhisperTokenizer.from_pretrained("openai/whisper-small", language="chinese", task="transcribe")
這邊再多一個 Processor
from transformers import WhisperProcessor
processor = WhisperProcessor.from_pretrained("openai/whisper-small", language="chinese", task="transcribe")
然後把所有音檔的採樣頻率轉成 16k
from datasets import Audio
common_voice = common_voice.cast_column("audio", Audio(sampling_rate=16000))
print(common_voice["train"][0])
印出來 sample_rate
的資訊應該都變成了 16000
對 Dataset 做些小處理
def prepare_dataset(batch):
# load and resample audio data from 48 to 16kHz
audio = batch["audio"]
# compute log-Mel input features from input audio array
batch["input_features"] = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"]).input_features[0]
# encode target text to label ids
batch["labels"] = tokenizer(batch["sentence"]).input_ids
return batch
common_voice = common_voice.map(prepare_dataset, remove_columns=common_voice.column_names["train"], num_proc=1)
這邊還要注意一下後面原本是 num_proc=4
,跑下去有問題或電腦沒有那麼強切成 1 或許會比較好
import torch
from dataclasses import dataclass
from typing import Any, Dict, List, Union
@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
processor: Any
def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
# split inputs and labels since they have to be of different lengths and need different padding methods
# first treat the audio inputs by simply returning torch tensors
input_features = [{"input_features": feature["input_features"]} for feature in features]
batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
# get the tokenized label sequences
label_features = [{"input_ids": feature["labels"]} for feature in features]
# pad the labels to max length
labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(labels_batch.attention_mask.ne(1), -100)
# if bos token is appended in previous tokenization step,
# cut bos token here as it's append later anyways
if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
labels = labels[:, 1:]
batch["labels"] = labels
return batch
data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)
那就先到這!
再次提醒,我這幾篇全都是出自這篇文章
來台北好累,剩下三天!